AdEase Time Series

Ad Ease is an ads and marketing based company helping businesses elicit maximum clicks @ minimum cost. AdEase is an ad infrastructure to help businesses promote themselves easily, effectively, and economically. The interplay of 3 AI modules - Design, Dispense, and Decipher, come together to make it this an end-to-end 3 step process digital advertising solution for all.

You are working in the Data Science team of Ad ease trying to understand the per page view report for different wikipedia pages for 550 days, and forecasting the number of views so that you can predict and optimize the ad placement for your clients. You are provided with the data of 145k wikipedia pages and daily view count for each of them. Your clients belong to different regions and need data on how their ads will perform on pages in different languages.

Data Dictionary:

There are two csv files given

train_1.csv: In the csv file, each row corresponds to a particular article and each column corresponds to a particular date. The values are the number of visits on that date.

The page name contains data in this format:

SPECIFIC NAME LANGUAGE.wikipedia.org ACCESS TYPE _ ACCESS ORIGIN

having information about the page name, the main domain, the device type used to access the page, and also the request origin(spider or browser agent)

Exog_Campaign_eng: This file contains data for the dates which had a campaign or significant event that could affect the views for that day. The data is just for pages in English.

There’s 1 for dates with campaigns and 0 for remaining dates. It is to be treated as an exogenous variable for models when training and forecasting data for pages in English

Additional views:

In this case study, our focus is to analyze and forecast user views for various category of Wikipedia pages; Our scope in this case study is limited to using language as the category. We will begin our solution with pre-processing activities such as aggregation of duplicate pages, removal of non-wikipedia pages (including wikimedia.org or mediawiki.org), and extraction of title, access_type, access_orig, and language features from each page. After that, we analyze missing values for each page. Broadly, we categorize missing values (or NANs) into three categories: leading NANs (pages didn't exist at that point in time), trailing NANs (discontinued pages), and in-between NANs (genuine missing values). We impute the missing values in these categories using different approaches as described later. As the last pre-processing step, we aggregate page views at language level. We then analyze various time series, plot their graphs and ACF/PACF functions, check their stationarity (Dickey Fuller test), and apply d-order differencing and/or seasonal differencing to make them stationary. In the final section, we build various time series models using ARIMA, SARIMA/SARIMAX, and Prophet. As part of SARIMAX modeling, we create a custom wrapper SKLearn estimator to use GridSeachCV for hyper parameter tuning. We will use MAPE as the evaluation/performance metric.

Imp Notes:

  1. In this case-study, we will mainly perform time series analysis for 'en' language pages.

  2. We will create utility functions for hyper-parameter tuning, best parameter selection (based on least MAPE criterion), final model building, and to plot forecasts. Thus, for each language, we can run all the steps with a single function call. We will, however, not build sklearn pipelines as there are not many pre-processing steps involved here.

Solution

Data import and analysis

Process 'Page'

Observation: We see that all the non-wikipedia.org URLs are either wikimedia.org or mediawiki.org URLs. We will not consider them in the further analysis and drop them.

Check for duplicate pages

Observations: As we observe, there are several duplicate page names. We will aggregate them into single row per such page by summing their visit counts.

Extract title, lang, access type, and access origin

Observation: This are again mediawiki.org or wikimedia.org URLs. We can remove them from further analysis.

Analyze time series for a few pages

check stationarity

check autocorrelation and partial autocorrelation plots to determine seasonality (if any)

Observation:

  1. Most of the randomly selected pages have stationary time series.

  2. In the non-stationary time series, we see small to moderate spikes in ACF and PACF plots at lag-7 for some pages. So for further decomposition, we can consider weekly seasonality.

Observations: For stationary time series, we do not see any trend as expected. For non-stationary time series, there is no clear general pattern observed in trend. However, we do see weekly seasonality effect.

Missing value analysis

Observation: The visulization above shows that majority of the missing values seem to be occurring as as chunks in the farthest past (the pages likely didn't exist at that point). As we move towards more recent past, the number of missing values come down. Also, there are some time series' where NANs occur frequently. We can now compute relevant stats to further understand the distribution of missing values.

Observations:

  1. There are total 19103 pages having one or more missing values. We can further segregate missing values (or NANs) into the following categories.

    1. Leading NANs - which occur as the consecutive sequence of NANs farthest back in time. A large enough leading NAN value signifies that the page didn't exist in that time frame.

    2. Trailing NANs - which occur as the consecutive sequence of NANs in the most recent past. A large enough trailing NAN value signifies that the page may have been removed/discontinued/renamed at some point in the past.

    3. Between NANs - These are genuine NAN values occurring when the page was in existence.

  2. Around 530 pages have trailing_nan value > 500. Similarly there are 839 pages with trailing_nan value > 30.

Removing discontinued pages

All pages with missing values for last 30 days or more (that is trailing_nan >= 30) can be considered discontinued and safely removed from our analysis. For all pages which have values missing for last 7 days to 30 days, we can compute probability of it being discontinued as shown below.

P_discontinued = 1 - (P_nan ^ trailing_nan)

where P_nan = probability of a specific time-series value being NAN (~avg_nan_prop)

All pages where P_discontinued > 0.95 can also be safely removed.

Remove sparse pages with very high total NANs or very high NAN rate

We can remove the pages which satisfy the following conditions.

if page['avg_nan_prop']  > 0.5 then 
    rem_list.append(page)
else if page['total_nan'] > 350 and page['avg_nan_prop'] > 0.3 then
    rem_list.append(page)

remove all pages in rem_list

Missing value imputation

Now that we have removed discontinued and sparse page, in this section, we can impute missing values.

Aggregate time series at language level

Aggregated Time series analysis (for each language)

Basic visualization

Effect of campaigns

Observation: We can observe that campaigns for 'en' pages had positive effect on total user visits. The %increase in visits is 55%. The absolute positive increase is 161 million visits. All other languages, except for 'ru', have no significant impact because of campaigns run for 'en' language. 'ru' language pages, interestingly, show ~86% increase on campaign days for 'en' language. Since we do not have campaign details about other languages, it is quite possible that similar campaigns were run on the same days for 'ru' as well. In the absence of any information, we will not use 'en' campaign information for 'ru' models.

Stationarity, decomposition, and differencing

Observations:

Time series Modeling

train-test split

Utility functions

Hyperparameter tuning functions

ARIMA family of models

ARIMA model

SARIMA model

SARIMAX model with campaigns exogenous variable

Observations:

For ARIMA model, the best MAPE score is 6.5%. For SARIMA and SARIMAX models, the best MAPE score is 4.1%.

Forecasting using Prophet

OBservations:

  1. For 'en' time-series, we got MAPE of 5.8%.

  2. The lowest MAPE of 5% was recorded for 'zh' page time-series. The highest MAPE was recorded for 'ru' pages at 16.5%. This is expected, as for 'ru' pages, we see a very high increase in visits during campaign days for 'en' language. Since we have consciously not added campaign effect in any non-english pages, the MAPE is high.

Questionnaire

1. Defining the problem statements and where can this and modifications of this be used?

Ad Ease is an ads and marketing based company helping businesses elicit maximum clicks @ minimum cost. The problem statement is as follow. The ask is to understand the per page view report for different wikipedia pages across regions and forecasting the number of views so that Ad Ease can predict and optimize the ad placement for their clients who belong to different regions. Since data at individual page level is sparse, we aggregate views at language level and forecast. These forecasts can then be potentially redistributed at individual pages to identify pages which are likely to get maximum views. The advertisements should be placed on such pages to increase its reach and garner maximum clicks.

2. Write 3 inferences you made from the data visualizations

  1. Time series data at individual pages is often sparse. When we checked a small sample of individual pages, most of them had stationary time series. The ones with non-stationary time series had weekly seasonality effects.

  2. After aggregating data at language levels, all aggregated time series became non-stationary. All time series became stationary after applying first-order differencing and/or weekly seasonal differencing. We could confirm this through ACF/PCAF plots and stationarity tests.

  3. We observed that 'en' campaigns had positive impact on number of views on 'en' pages. The average increase was around 55%. Interestingly, the number of views on 'ru' pages increased around 85% during the 'en' campaign days. This could be because of several reasons. First, campaign designed for 'en' pages may also have reached 'ru' viewers. Second, there could have been similar campaign running on the same days for 'ru' pages, for which we do not have data for. Third, there could be another confounding variable impacting both 'en'/'ru' pages and campaign days. Since we do not know for sure, in our models for 'ru' page, we have chosen not used 'en' campaign data as an exogenous variable.

3. What does the decomposition of series do?

The decomposition of time series is a statistical task that deconstructs a time series into several components, each representing one of the underlying categories of patterns. The components are trend, cycle, seasonality, and noise (irregular components).

4. What level of differencing gave you a stationary series?

For most language pages, first order differencing or weekly seasonal differencing was sufficient to obtain stationary series.

5. Compare the number of views in different languages

As we can English by far has the highest number of average total daily visits (across pages) and also highest number of daily visits per page.

6. What other methods other than grid search would be suitable to get the model for all languages?

Using AutoArima library is one option to reduce the hyper parameter tuning effort for AARIMA family of models. Another potential approach is to use boosted regression trees (XGBoost or liteGBM) and explicitly pass various lagged time variables, lagged seasonal variables, language indicator, as well as exogenous variables such as campaigns. The expectation is for the regression tree algorithm to identify most important variables (which must include lang as well), and build several regression trees at leaf level. This approach may prove to be faster and more scalable as it may be sufficient to build a single model which may take care of all languages' time series.